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Abstract 

Motivation: The analysis of physiological processes over time is becoming increasingly 
important. The measurements are often given by spectrometric or gene expression profiles 
over time with only few time points but a large number of measured variables. The analysis 
of such temporal sequences is challenging and only few methods have been proposed. The 
information can be encoded time independent, by means of classical expression differences 
for a single time point or in expression profiles over time. Available methods are limited 
to unsupervised and semi-supervised settings. The predictive variables can be identified 
only by means of wrapper or post-processing techniques. This is complicated due to the 
small number of samples for such studies. Here, we present a supervised learning approach, 
termed Supervised Topographic Mapping Through Time (SGTM-TT). It learns a supervised 
mapping of the temporal sequences onto a low dimensional grid. We utilize a hidden 
markov model (HMM) to account for the time domain and relevance learning to identify 
the relevant feature dimensions most predictive over time. The learned mapping can be 
used to visualize the temporal sequences and to predict the class of a new sequence. The 
relevance learning permits the identification of discriminating masses or gen expressions 
and prunes dimensions which are unnecessary for the classification task or encode mainly 
noise. In this way we obtain a very efficient learning system for temporal sequences. 

Results: The results indicate that using simultaneous supervised learning and metric 
adaptation significantly improves the prediction accuracy for synthetically and real life 
data in comparison to the standard techniques. The discriminating features, identified by 
relevance learning, compare favorably with the results of alternative methods. Our method 
permits the visualization of the data on a low dimensional grid, highlighting the observed 
temporal structure. 

Contact: fschleif @techf ak, uiji-bielefeld . del 

Keywords: high-dimensional time series, short time series, prototype learning, relevance learn- 
ing, topographic mapping 
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1 Introduction 



The analysis of high-dimensional, short time series, or temporal sequences is a challenging task. 
On the one hand side the data are not any longer identical and independent distributed (i.i.d) 
due to the time constraint, on the other hand the dimensionality of the data is large, compli- 
cating the learning of a predictive model. Standard time series methods like auto-regressive 
moving average (ARMA) or extensions thereof (see e.g. [3]) are in general not applicable due to 
the limited number of time points and the large dimensionality of the data. Some methods have 
been proposed to model this type of data. In '21? an unsupervised projection techniques was 
proposed employing a so called temporal context. The temporal data have been processed by 
a kind of Self Organizing Map (SOM) [TT] but the learning was modified such that it depends 
on the the current temporal context. A further unsupervised proposal has been made in '14J 
using the Generative Topographic Mapping Through Time (GTM-TT) ( 3J). Some new hidden 
variables were introduced to account for the relevance of the different feature dimensions, to 
accounts, in a non-discriminative manner, for the explained variance in the data over time. A 
supervised two-class method solely based on hidden markov models was proposed in [13] . It 
models the two different data distribution by independent HMMs and evaluates the generated 
models to obtain a ranking of the input dimensions. Subsequently the model was improved by 
selecting a set of features using a wrapper strategy. In [6j a similar approach was proposed but 
in a semi-supervised scenario, introducing classwise constraints in the hidden markov model. 
The importance of the individual features was determined using a complex post processing 
procedure. Another supervised method using all features, based on Support Vector Machine 
(SVM) and a Kalman filter was proposed in [5]. 

While the first two approaches have been found to be very effective for unsupervised analysis, 
the last mentioned methods focus on supervised and semi-supervised analysis. The results in 
|13| are very promising, with 85% prediction accuracy on a real life multiple sclerosis data 
(MS) set, but make strong pre-assumptions about the underlying HMM structure. Also, it is 
proposed for two class scenarios, only. The approach in [5j improved this result by 2 — 5% but 
in a black box scenario, without additional feature selection. The approach in [B] is evaluated 
also with respect to the results of |13j achieving improved performance for the same MS data 
sets. There is still ongoing work of research in this field, reflecting the high demand for effective 
methods dealing with this type of data. The application field is not limited to the bio-medical 
domain as considered in 1131 111 IHI but covers a broader field of applications also in industry and 
geo-science as refiected in [T^ ^ ■ 

The identification of the relevant input dimensions of a temporal sequence is very important 
as outlined in [TU [13] to obtain better understanding of the data, to reduce the processing 
complexity and to improve the overall prediction accuracy. As already motivated by some of 
the prior references, prototype methods (see e.g. [TT]) have been found to be very effective for 
the analysis of high dimensional data also to analyze temporal sequences. In [3], the Generative 
Topographic Mapping - through time (GTM-TT), an unsupervised prototype based method for 
the topographic projection of high-dimensional, temporal sequences was proposed. GTM-TT 
learns a hidden markov model (HMM) of a data generating process and represents the data by 
a prototype based representation in time and space. Like in ordinary prototype methods the 
GTM-TT approximates the data distribution by a vector quantization of the data space. The 
temporal dependence between the prototype is modeled by an appropriate HMM. Additionally 
the prototypes are assigned to a fixed grid representation or lattice, which permits, provided the 
topology is preserved (see [12]), the easy visualization and interpretation of the data trajectory 
in a low dimensional space. In this contribution we extend the GTM-TT to a supervised method 
and integrate relevance learning to identify the relevant dimensions over time. Then we will 
briefly review Generative Topographic Mapping (GTM) and Generative Topographic Mapping 
Through Time. Subsequently, we outline our method and apply and discuss it for different 
experimental data. The paper is closed with links to further extensions and open questions. 
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Figure 1: GTM-TT consisting of a HMM in which the hidden states are given by the latent 
points of the GTM modeL The emission probabilities are governed by the GTM mixture 
distribution The different data distributions, exemplified in 3D (bottom) and indicated by 
the color/shading are mapped to the 2D grid (top). Here we consider 9 hidden states on a 3 x 3 
grid. The data distribution may change over time and hence also the mapping of the GTM is 
effected over time, assuming smooth transitions. 



2 Approach and Methods 

2.1 Generative Topographic Mapping 

The Generative Topographic Mapping (GTM) as introduced in [i] models data x e by 
means of a mixture of Gaussians which is induced by a lattice of points w in a low dimensional 
latent space which can be used for visualization. 

The lattice points are mapped via w M- t = j/(w, W) to the data space, where the function 
is parametrized by W; one can, for example, pick a generalized linear regression model based 
on Gaussian base functions 

y : w M- <l>(w) • W (1) 

where the base functions <i> are equally spaced Gaussians. The high-dimensional points y are so 
called prototypes of the original data space, representing a larger set of points, they are respon- 
sible for, as measured by Eq. ([s]) . They can be directly inspected and permit to summarize the 
data. Every latent point induces a Gaussian 



p(x|w,W,/?) 



D/2 



exp 



x-j;(w,W)|l- 



with variance /3 ^ , which gives the data distribution as mixture of K modes 

p(x|W,/3) = > «(w^■)p(x|w^W,/3) 



K 

fe=l 



(2) 



(3) 



where, usually, p(w'^) is taken as Dirac distributions of the prototypes. Training of GTM 
optimizes the data log-likelihood 




p(w'=)p(x"|w^w,/3) 



(4) 
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by means of an expectation maximization (EM) approach with respect to the parameters W 
and /3. In the E step, the responsibihty of mixture component k for point n is determined as 



Y^k'Py^ i'^ ,w,/3)p(w'= ) 

In the M step, the weights W are determined solving the equality 

*^Goid*W,T,^ = *^RoidX (6) 



where $ refers to the matrix of base functions $ evaluated at points w'^, X to the data 
points, R to the responsibilities, and G is a diagonal matrix with accumulated responsibilities 



Gnn — J2k '''^"(W, /3). The variance can be computed by solving 



^ = T^E'''"(WoM,/5old)||$(w''0W„ew -X"||2 (7) 

Pncw ^ , 

k.n 

where D is the data dimensionality and N the number of data points. 



2.2 Relevance learning 

The principle of relevance learning has been introduced in [10] as a particularly simple and effi- 
cient method to adapt the metric of prototype based classifiers according to the given situation 
at hand. It takes into account a relevance scheme of the data dimensions by substituting the 
squared Euclidean metric by the weighted form 

D 

dx(x,t) = ^A2(xd-td)2. (8) 

d=i 

The principle is extended in |181 117] to the more general metric form 

ds^(x,t) = (x-t)^O^f2(x-t) (9) 

Using a square matrix f2, a positive semi-definite matrix which gives rise to a valid pseudo- 
metric is achieved this way. In |18[ll7j. these metrics are considered in local and global form, i.e. 
the adaptive metric parameters can be identical for the full model, or they can be attached to 
every prototype present in the model. Relevance learning for GTM has been already introduced 
in [7] for i.i.d. data. In case of temporal sequences some modification of the original principle 
are necessary and also the supervision will be handled differently as pointed out subsequently. 
First however we review the GTM through time as described in which is the basic method 

to process i.i.d. data in our approach. 



2.3 Generative Topographic Mapping Through-Time 

The GTM through time (GTM-TT) has been introduced in For data vectors x which have 
the form of a time series the vectors are no longer independent. Nearby timepoints are likely 
to be correlated. As pointed out in [3] such effects can be captured using Hidden Markov 
Models (HMM). Accordingly in 3 the GTM is equipped by a HMM, constructing a kind of a 
topology-constrained HMM 

The structure of the GTM-TT is shown in Figure [l] Assuming a sequence length T, of 
hidden states Z = {zi, . . . , Zn, . . . zt} and the observed multidimensional time series X = 
{x^,x'^, . . . , cc", . . . x^}, the probability of the observations is given by 

p{X)^ E p{Z,X) (10) 

all sequences Z 
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where p{Z, X) defines the complete data hkehhood as in HMM models [4] taking the following 
form: 

T T 

=p(zi) []p(z„|z„_l) nP(^"kn) (11) 

n=2 ri = l 

So it consists of the initial state probability, the transition probability between two hidden 
states, capturing the temporal dependence, and the probability to observe a specific sequence 
in a given state also known as emission probability (covered by Eq. ([2])). The model parameters 
are O = {'k,A^W,(3) where tt = {tTj} : tTj := p{zi = j) are the initial state probabilities. 
■A = Wij} '■ CLij = p{zn = j\zn-i = are the transition state probabilities, and {W,/?} are 
given by Eq. ^. A gain we control the gaussians by a common invariance /?. As in HMM the 
above likelihood can be efficiently calculated using the forward backward procedure \2Z\ . The 
probability being in state at time n, given the observation sequence and the model, also 
known as responsibility r*^" is calculated as: 

,'="=p(,„^w'^|x,e) = ^^ (12) 

The forward variable A^n is the joint probability of the past sequences {x^ , . . . , x"} and the state 
Zn — w*^, i.e. — p{{x^ : ■ • ■ : 2;"}, z„ ~ w'''|0), given by the following recursive equation: 

iPl.kj Pfc(a;") (13) 

where A^.i — TTkPk{x^)- The backward variable Bkn which is the probability of the future se- 
quence 2;"+-'^, 2;"+^, . . . , given the hidden state Zn = w'^, i.e. Bkn = p{{x"'^^ ^ a;"+^, . . . , x^}|z„ 
w'', Q) is calculated using the following recursive equation: 

K 

Bkn = ^p^,fcP^(a;"+l)B,,„+l (14) 

i=l 

where B^^t — 1- The whole parameter estimation can be accomplished by a maximum likelihood 
optimization using the EM algorithm as sketched above. Details can be found in [19] . 



2.4 Supervised GTM-TT 

Assume that data point X is equipped with label information I which is element of a finite 
set of different labels L, e.g. L = {0, 1}. Lets assume we have only two labels The data 
are divided into two groups, according to the labeling and we train one separate GTM-TT per 
group. To keep the models comparable, the /? update for the models is linked to each other and 
optimized in the outer loop. The parameters W are determined for each model individually 
leading to Wq and Wi. We will further assume that the grid structure is common for both 
models. The learning procedure is thus similar to GTM-TT and depicted in Figure [T] 



We denote the obtained model as Supervised GTM-TT (SGTM-TT) and the submodels as Mq 
and Ml. The concept of the SGTM-TT is depicted schematically in Figure [2| 



2.5 Classification using SGTM-TT 

To classify new data points with the SGTM-TT model different approaches can be taken. The 
simplest one is to make direct use of the samplewise likelihoods considering the class wise 
models. In that case a new point is assigned to the model with maximal likelihood considering 
one model against the rest. A more interesting approach is to combine the performance of the 

^An extension to multiple labels is straight forward. 
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Algorithm 1 Pseudocode of supervised GTM-TT 

function Supervised GTM-TT{X,L,K) 
[Xn,Pars ] = normalize(X) 
[X1,X2,L1,L2] = splitdata(Xn,L) 
[Mo, Ml] = init both GTM-TT models 
repeat 

call train_single_step for Mq, Mi 
call convergence-check for Mq, Mi 
call optimize_beta for Mq, Mi 
13 = calculate mean of the /3 
call update_beta(Mo , Mi,/3) 
until convergence is true for both models 
end function 
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Figure 2: Illustration of the SGTM-TT. It consists of multiple GTM-TT models. It behaves 
similar to the regular GTM-TT but the training is classwise and the /3 parameter is common 
over the different models. The different classwise models (top) are used to represent the data 
distribution (bottom) over time (from left to right). 



generative SGTM-TT model with a discriminative approach like the SVM 



Again we use 

the likelihood values from the forward procedure ( 13 ) of the SGTM-TT and define a kernel as 
follows: 



K 



Likj = Aij for a series j and a sub-model I 



K{X,,Xk) = 



i=l 

E 

l=l:#L 



Lik^^ ■ Lik\ with equal prior 



(15) 
(16) 



Hence the kernel K is based on a kernel function of inner-products in a one dimensional feature 
space of the likelihood-values. In the following we will make use of this approach employing a 
standard SVM implementation. 



2.6 Relevance learning for SGTM-TT 

Relevance learning for GTM has been introduced in 0, as the Relevance GTM (R-GTM). The 
basic idea for Relevance GTM is to introduce an adaptive metric for the GTM. The original 
Euclidean metric is replaced by a parametric distance like the weighted Euclidean metric ([s]). 
After each GTM training step the prototypes are post-labeled according to its responsibilities, 
employing the labeling L of the datapoints. Subsequently the metric parameters of the distance 
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are adapted according to an optimization criterion. In the article of [7 different cost functions 
E where suggested. 

The data of the GTM-TT are not any longer i.i.d. and, as mentioned before we observe 
a sequence of states Z for a given time series X. In the SGTM-TT we know the labeling of 
the prototypes, assuming constant labels over time, due to the split of the learning problem 
according to the data labeling. Further, using a common metric and common /3 parameters 
the prototypes Y exist still in the same common dataspace. Relevance learning can now be 
done in the same way as for R-GTM. This however is often not useful because the original 
relevance learning ignores the time domain. If data separation is observed over time and not 
for a single time point the R-GTM approach will fail. For temporal sequences we may also 
be interested on two views of relevance, namely relevant, or separating input dimensions Xi 
but also relevant time points in a temporal sequence x. Taking this problem into account we 
consider two distance measures, one for the time domain, denoted as and one for the time- 
independent data space d} . A parametrization of d* can be used to account for the relevance of 
specific time points, e.g to prune out time points which are irrelevant for the representation of 
the data in a discriminative manner. Parameters on can be used to identify discriminating 
feature dimensions, e.g. to prune out noisy dimensions. Subsequently, we provide a distance 
measure which can be used for d* and a specific form for d^ . For simplicity we will use a simple 
global^ diagonal metric learning scheme in the experiments. 

SGTM-TT provides a probabilistic prediction of the internal representation x of a time 
series x considering the two GTM-TT models, wc obtain one reconstruction each: 

X"' = Yi (arg max (r*^") , i) Vi G [1, D] 

k 

with I e {0, 1} 

Now, two distances are calculated over time for each point and each dimension i: (i*(x"^,x"), 
d*(x",[,x"). Using one of the suggested cost functions in the paper of we can calculate the 
relevance of the individual dimensions for the separation between the two reconstructions per 
point and hence between the different models. 

Like for R-GTM the metric adaptation is done by an appropriate optimization scheme on 
the cost functions, here we will use stochastic gradient descend, with a fixed learning rate 
e = 0.1. To avoid convergence to trivial optima such as zero we pose constraints on the metric 
parameters of the form || A|| = 1 or trace(r2"^f2)^ = 1, for matrix learning. This is achieved by 
normalization of the values, i.e. after every gradient step, A is divided by its length, and Q, is 
divided by the square root of trace(r2"^r2). 

A pseudo code of the SGTM-TT with relevance learning is depicted in [2j 

Usually, we alternate between one EM step, one epoch of gradient descent, and normalization 
in our experiments and start the metric learning after 10 epochs of EM learning to allow a 
reasonable pre-positioning of the GTM-TT in the dataspace. The metric learning is annealed by 
e. Since EM optimization is much faster than gradient descent, this way, we can enforce that the 
metric parameters are adapted on a slower time scale. Hence we can assume an approximately 
constant metric for the EM optimization, i.e. the EM scheme optimizes the likelihood as before. 
Metric adaptation takes place considering quasi stationary states of the GTM solution due to 
the slower time scale. The call of train_single_step is a regular EM optimization step of the 
GTM-TT but without the adaptation of the parameter /3 which is postponed to allow a linking 
between the two GTM-TT models included in the SGTM-TT. 

Now, we briefly review a concrete cost function E of the relevance GTM for the metric 
adaptation as already introduced in [7] but account for the alternative distance calculations 
mentioned before. 

Cost function - Generalized Relevance GTM (GRGTM) 

Metric parameters have the form A or A*" for a diagonal metric ([s]) and H or Jl'"' for a full matrix 
([9]), depending on whether a local or global scheme is considered. In the following, we define the 
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Algorithm 2 Pseudocode of supervised GTM-TT with relevance learning 

function SGTM-TT-R(X,L,_ft') 
[Xn,Pars ] = normalize(X) 
[X1,X2,L1,L2] = splitdata(Xn,L) 
initialize the common metric 
[Mo, Ml] = init both GTM-TT models 
repeat 

call trainsinglestep for each GTM-TT model 
call convergence_check for each GTM-TT model 
if cycle > 10 then 

VX,Vi = 1 : D call reconstruct(Xi, Mq, Mi) 
\fX,\/i = 1: D call {Mq, Xio, xi) 
VX.Vi = 1: D call d'(Mi, Xii, Xi) 
VX call calculate_metric_update 
average the metric updates and normalize 
update the metric parameter annealed by e 
end if 

call optimize^beta for each GTM-TT model 
^ = calculate mean of the f} 
call update_beta(Ml,M2,j3) 
until convergence is true for both models 
end function 



general parameter O'^ which can be chosen as one of these four possibilities depending on the 
given setting. Thereby, we can assume that 0*^ can be realized by a matrix which has diagonal 
form (for relevance learning) or full matrix form (for matrix updates). 

The cost function of generalized relevance GTM is taken from generalized relevance learning 
vector quantization (GRLVQ), which can be interpreted as maximizing the hypothesis margin 
of a prototype based classification scheme [lOl |lH| • The cost function has the form 



^ n V^e+(x",x» )-fde-(x",x" )J 

where sgd(a;) = (1 -I- exp(— a;))~^, x"^ is the reconstruction of x" over time using the model 
Mq or Ml depending on the label of x, -I- indicates the model with the same level — the model 
with a different label or the model for the remaining data. 

The adaptation formulas can be derived thereof by taking the derivatives with respect to 
the metric parameter. Depending on the form of the metric, the derivative of the metric is 
simple 

^^^^^=2X,d\x,,x\r (18) 

for a diagonal metric and 

^'^"^^' ''"^ - 2d*(a;„x-)^f^,,d*(:r,,x",) (19) 



d 



for a full matrix. 

For simplicity, we denote the respective squared distances to the closest correct and wrong 
model, respectively, by = dQ+ (x", x+) and — dg- (x", x^). The term sgd' is a shorthand 
notation for sgd'{{d~^—d~)/{d~^+d~)). Given a data point x" the derivative of the corresponding 
sumniand of cost function E with respect to metric parameters yields 

dEn , d- dd+ 

9eT-^^S^- (rf+ + d-)2 W (20) 
for the parameters of the closest correct prototype and 

dEn „ d+ dd- 
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(a) Two functions: Euc = LP-norm (b) Two functions: Euc ^ L^-: 



Figure 3: Illustration of the L^-norm. Plot (a) indicates the case in which the distance between 
two functions is equal, both for Euclidean or L^-norm. In plot (b) parts of the functions are 
interchanging (crossing). The distance using Euc is still the same as in plot (a) but for the 
i^-norm the distance is changed, giving a more realistic measure of the distance of the two 
functions. 



for the parameters attached to the closest wrong model. All other parameters are not affected. 
As pointed out before we choose only a global metric such that the update corresponds to the 
sum of these two derivatives. 



Distance measure for functional data 

Here we consider a functional distance measure as an extension of the LP norm proposed in 
([12]) subsequently denoted as (FUNC). The functional distance measure has the advantage 
of taking the functional nature of the data into account, or in our case the dependence over 
time, which also constitutes a function j(t\ with potentially discrete arguments t. It has been 
already successfully used for the analysis of biomedical data as shown in [16]. The standard 
Euclidean distance considers the individual features of a signal independent, so that a change 
in the order of the positions does not affect the calculated distance. However, the measurement 
points over time are not independent, so that a distance taking this aspect into account can 
be considered to be more appropriate for this type of data. Lee proposed a distance measure 
taking the functional structure into account by involving the previous and next values of a 
signal Vi in the «-th term of the sum, instead of vi alone. Assuming a constant sampling period 
T, the proposed norm (FUNC) is: 



'C^'(^)= (E(^'^W + ^^W)'| (22) 

with 



\k=\ 




if < WfcWfc-i 
if > VkVk-i 

if < VkVk+i 

? I TTT r if > VhVh+l 



(23) 
(24) 



representing the triangles on the left and right sides of Vi and D being the data dimensionality. 
For the data considered in this paper w is a time series or a prototype reconstruction. As for 
Lp, the value of p is assumed to be a positive integer. At the left and right extremes of the 
sequence, vq and vd are assumed to be equal to zero. The concept of the L^'-norm is shown 
in Figure [3] The calculation of this norm is slightly more complex than that of the standard 
Euclidean. 
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2.7 Data set description 

Subsequently we consider two data sets to evaluate our approach. 

2.7.1 Simulated data sets 

The first one is a simulated two class scenario, proposed in the paper of 15] . It consists of 100 
samples divided into two classes of 50 samples each. For each sample 100 features have been 
generated with 8 time points. Out of the 100 features, only 10 where substantially differentiating 
between the classes. The generation mechanism behind the simulated data is to sample the time 
series from a piecewise linear function. At a later step, sample-specific variation is included by 
shrinking and expanding the curves. 

2.7.2 Multiple sclerosis data 

The second data set is taken from [2 (IBIS) in the prepared form, given in ;6J. The data are 
taken from a clinical study analyzing the response of multiple sclerosis (MS) patients to the 
treatment. Blood sample entrenched with mono- nuclear cells from 52 relapsing-rcmitting MS 
patients were obtained 0, 3, 6, 7, 12, 18 and 24 months after initiation of IFN/3 therapy. This 
resulted on an average of 7 measurements across the 2 years. Expression profiles were obtained 
using one-step kinetic reverse-transcription PGR over 70 genes selected by the specialists to 
be potentially related to IFN/3 treatment. Overall, 8% of the measurements were missing due 
to patients missing the appointments. After the two year endpoint, patients were classified as 
either good or bad responders, depending on strict clinical criteria. Bad responders were defined 
as having suffered two or more relapses or having a confirmed increase of at least one point 
on the expanded disability status scale (EDSS). A good responder was to have a suppression 
of relapses and not allowed to have an increase on the EDSS. From the 52 patients, 33 were 
classified as good and 19 as bad responders. A more detailed description of the data set is 
available in the paper of [2j and the supplemented material, therein. 

3 Results and Discussion 

For the simulated and the MS data set, we reanalyzed the classification accuracy of the SGTM- 
TT with 9 hidden states and 4 basis functions. The analysis was done within a 4 fold cross- 
validation with 5 repetitions. We compared it with the general HMM classifier (HMM-Lin) 
and the discriminative HMM classifier (HHM-Disc-Lin) proposed in [13_ . We also included the 
results of [2] who originally proposed the MS study, the analysis of jl], employing a Kalman 
Filter combined with an SVM approach and 6 proposing a semi-supervised analysis coupled 
with a wrapper and cut-off technique to identify discriminating features. 

3.1 Simulated data 

We applied SGTM-TT with relevance learning for the simulated data set of [13]. We observed 
an overall prediction accuracy of 94 ± 4. The relevance profile identified all known 10 features 
and effectively pruned out the remaining irrelevant data dimensions. Our results are slightly 
better than those reported in [T3] (90%) and by (92%). 

3.2 Multiple sclerosis experiment 

In Table [l] we have summarized the prediction (test-set) results for the classification of the MS 
data set in comparison to the results given in [2]. The obtained mappings of the SGTM-TT 
are topology preserving and we analyzed the mapping of the points to its prototypes and the 
neighborhoods. The map for the first class is depicted for two temporal sequences in Figure [5j 

^In our observations the topographic error was reasonable small. 
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Figure 4: Relevance profile as obtained using SGTM-TT with relevance learning. The plot 
shows the average relevance (blue/dark), minimal relevance (green/bright) and the standard 
deviation of the relevance, flipped to the negative part of the relevance axis. We observe that 
the standard-deviation is relatively small, hence the relevance profiles of different runs are very 
stable. The most discriminative features (high- relevance), can in parts also be found in but 
some additional features appear to be relevant, and our proposed set consists of 7 genes rather 
17 like in [6] 




Figure 5: Illustration of the 3x3 SGTM-TT mapping for the responder class. Plots in the 
first row show a typical state sequences. Also if the state sequences Z are not identical we can 
expect that the underlying signals X are similar due to its close neighborhood on the map. 
This is reflected by such clustered signals at the bottom. The start of a sequence is indicated 
by □ and the termination state by a o. 
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Method 


Number of genes 


Test accuracy (%) 


SGTM-TT 


70 


85.66 ±8.3 


SGTM-TT-R 


7 


93.43 ± 5.8 


IBIS 


3 


74.20 


Kalman-SVM 




87.80 


Lin-Bcst 


7 


85.00 


Costa-Best 


17 


92.70 ±6.1 



Table 1: Prediction accuracies on the test data for different models using the MS data set. We 
observe improved predicition accuracy employing feature selection. This is also true for SGTM- 
TT which improved by « 6% using relevance learning and the SVM classifier. Interestingly 
also the prediction accuracy on the full data set, including all features and without relevance 
learning is quite good with nearly 84% and hence close to the best result proposed in [T5] . 



Genes 


Relevance 


found by Lin (7) 


found by Costa (17) 


MAP3K1 


0.3014 


X 


X 


NFkBIB 


0.2609 






IRF8 


0.2584 




X 


Caspase 10 


0.2471 


X 


X 


Jak2 


0.1869 


X 


X 


FLIP 


0.1842 






RIP 


0.1647 







Table 2: Most relevant genes using SGTM-TT with relevance learning. 



As expected, results improved by integration of feature selection or relevance learning com- 
pared to the full feature set. Overall the SGTM-TT with relevance learning performed very well 
and achieved good results of 92.5% with respect to the best reported model and also a smaller 
number of necessary features. Further the integrated relevance learning avoids multiple, time 
consuming runs within a wrapper approach like for the techniques used in |131 16] . The obtained 
relevance profile is depicted in Figure |4] and provides direct access to an interpretation of the 
relevant features, or marker-candidates, pruning irrelevant or noise dimensions. The values of 
the relevance profile are roughly gaussian distributed with /i — 0.1. We calculate a threshold C 
for the most relevant features using C = /i + cr and obtain 7 most relevant features, summarized 
in Table [21 

The SGTM-TT also inherently models different subgroups by the probabilistic regularizing 
model of the GTM and GTM-TT [HUH]- Hence the model complexity is not so critical provided 
the map is reasonable large. This is a plus with respect to the approach presented in [S] which 
has the number of groups as an additional meta parameter. 

4 Conclusion 

We have presented a theoretically sound approach for the analysis of short temporal sequences. 
It is based on the novel idea to introduce supervision and relevance learning into Generalized 
Topographic Mapping through time. Our results show that we are able to achieve improved or 
similar performance to alternative methods for the simulated and the MS data set. Further the 
prototype concept of the underlying method permits a better understanding of the model and 
extended visualization performance. We also obtain a direct ranking of the individual features 
employing the relevance profile, rather by use of wrapper techniques. In future work we will 
explore more advance metric adaptation schemes and alternative functional distance measures. 
Further we would like to apply our approach to non-clinical data and make it more flexible with 
respect to missing values. 

^We would like to stress that due to the small sample size and the 4 fold cross-validation a missclassification 
of 1 point, accounts an error of 8%. 
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